The David De Gea Dilemma: Comparing Goalkeeper Greats Throughout History¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objects as go
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
June 24th 2023¶
gks = ["ps","vds","cech","iker","buffon","neuer","alisson","ederson","courtois","ddg","costa","onana"]
full_name = ["Peter Schmeichel","Edwin Van de Sar","Petr Cech","Iker Casillas","Gianluigi Buffon","Manuel Neuer","Alisson Becker","Ederson","Thibaut Courtois","David De Gea","Diogo Costa","Andre Onana"]
Penalty Kicks (+ shootouts)¶
For penalties, let's inspect some data and assses how De Gea compares to other goalies who are retired and deemed as legends, not just for United but for other major European clubs, namely:¶
- Peter Schmeichel (Manchester United)
- Edwin van der Sar (Manchester United)
- Petr Cech (Chelsea)
- Iker Casillas (Real Madrid)
- Gianluigi Buffon (Juventus)
alongside these active players who have consistently performed at a high level:
- Manuel Neuer (Bayern Munich)
- Alisson Becker (Liverpool)
- Ederson (Manchester City)
- Thibaut Courtois (Real Madrid)
and finally with two keepers that have been on United and the fans' radar:
- Diogo Costa (Porto)
- Andre Onana (Inter Milan)
pk_save_rate = [3/37,11/60,17/85,23/100,39/124,22/76,14/33,7/54,14/66,14/74,11/34,7/33]
data source: Transfer Market (e.g. Onana)
fig = go.Figure(data=[go.Bar(x=gks, y=pk_save_rate,hovertext=full_name)])
# Customize aspect
fig.update_traces(marker_color='rgb(158,202,100)', marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Penalty Kick Save Rates among top Keepers (entire career, all comps, excluding shootouts)')
fig.show()
While De Gea's penalty saving record is not necessarily something he can particulalry be proud of, I, as a United fan, find it quite funny how he was better than both van der Sar and Peter Schmeichel, both United legends.¶
This time, let's try using KL divergence to compare this data. KL Divergence, in short, lets us compare two probability distributions by calculating the expectation of the log-odds of two distributions.¶
Here, let's assume that, based on historical data, players have an 85% chance of scoring and a 15% chance of missing, partly because I couldn't find the consensus on this statistic after going through some data sources. But it seems like the number is somewhere north of 80 percent, so let's go with 85 percent for the sake of brevity of this presentation. (Also, some data sources set aside another percentage for players completely missing the goal, but let's combine that with GK saving the penalty because it is of my opinion that a player missing in any fashion can be attributed to the keeper. It's a mental game!)¶
To that end, our base distribution will be $p_{scored} = .85$ and $p_{saved}=.15$ (the implication being that the average keeper, in the context of PKs, will prevail against the shooter only 15 percent of the time), to which we will compare each goalkeeper's individual penalty kick distribution.¶
p = [.85,.15]
def kl_divergence(p, q):
return np.sum(p[i] * np.log(p[i]/q[i]) for i in range(len(p)))
gks_kld = []
for save_rate in pk_save_rate:
q = [1-save_rate, save_rate]
gks_kld.append(kl_divergence(p,q))
fig1= go.Figure(data=[go.Bar(x=gks, y=gks_kld,hovertext=full_name)])
# Customize aspect
fig1.update_traces(marker_color='rgb(200,202,100)', marker_line_color='rgb(8,48,107)',
marker_line_width=1.5, opacity=0.6)
fig1.update_layout(title_text='KL Divergence among top Keepers (entire career in all comps, excluding shootouts)')
fig1.show()
/var/folders/75/hf538pcj7917_ym2sh21yfw00000gn/T/ipykernel_29942/933137679.py:2: DeprecationWarning: Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead.
Two identical distributions produces a KL divergnece of 0, and thus the more similar two distributions are, the closer the KL divergence will be to 0. Thus, we can infer that:¶
- The likes of Edwin van de Sar, Petr Cech, Ederson, and De Gea are pretty much average PK savers.
- Peter Schmeichel having a higher KL divergence doesn't imply that he's better than the previous mentioned keepers, but that he's worse than the average keeper at saving penalty kicks (and we can infer this from the previous visualization where we saw his 8 percent PK save rate, lowest of the 10 keepers here)
- Buffon, Costa, and Neuer are great at saving PKs, but not as great as Alisson!
(The subtle assumption here is that all penalty kicks are equally difficult, regardless of the competition, whether or not the keeper's team is losing or winning at the time their team gave away a penalty, how good of a PK kicker the GK is going against, and etc.)
However, this is an analysis based on non-shootout PKs, meaning that it's excluding some historic moments such as:¶
- Edwin Van de sar's three penalty saves in the Community Shield (2007) against Chelsea, and the other two in the Champions League Final (2008), also against Chelsea
- Petr Cech single-handedly securing Chelsea's first Champions League victory against Bayern Munich in 2012 by denying Olic and Schweinsteiger.
- De Gea going zero for 11 (0/11) against Villareal in the Europa League final shootout (2021/2022).
- Neuer denying Kaka and Ronaldo, and with the help of Ramos sending it to the moon, beating Real Madrid in the Champions League (2011/2012)
(I'm having a hard time finding public data regarding shootouts, so I will expand on this as I manually collect relevant data on my own)¶
Now let's compare look into more advanced stats that may tell us more about where De Gea stands amonst other great players.¶
I will be looking into:¶
- Crosses_stp%: Percentage of crosses stopped
- Post Shot xG (PSxG) Prevention per 90: PSxG is the goals an average keeper is expected to concede, differentiated by the quality of the shot taken by the shooter. Thus, by subtracting the actual number of goals conceded, we can guage how well a keeper does compared to the average keeper when it comes to shot-stopping.
- Defensive Actions Outside of Penalty Area per 90 minutes (#OPA/90)
- Average Distance (AvgDist): Average distance covered when perfoming all defensive actions away from goal i.e. sweeping
with passing stats, i.e. completion rate for:
- Passes between 15 ~ 30 yards
- Passes longer than 30 yards
- Passes longer than 40 yards
- All Passes
data source: Fbref (e.g. link) note that these stats are available only for current, non-retired players; so Peter Schmeichel, van der Sar, Cech, Casillas, and Buffon will not be taken into consideration for this part of analysis
categories1 = ["Crosses_stp%","PSxG Prevention per 90","OPA/90","AvgDist","pass (15~30 yards)","pass (30 yards <)","pass (40 yards <)","Total Passing"]
categories = [*categories1, categories1[0]]
adv_stats = pd.DataFrame(np.array([[3.2,.05,2.93,23.3,98.4,66.3,48.3,86.7], [4.8,.12,2.28,18.7,98.5,61.6,44.3,85.2],[6.5,0,1.5,17.9,98.6,61.1,43.6,86.5],[6,.09,.96,14.8,98.9,50.7,32.6,80.6],[2.1,.04,.71,14.3,97.8,45.3,36.6, 70.6],[7.4,.09,1.22,16.6,98.7,53.6,40.9,77.5],
[6.4,0.06,1.21,16.2,98.4,54.7,41.3,81.0]]),
columns=categories1,index=full_name[5:])
scaler = MinMaxScaler()
for col in adv_stats.columns:
adv_stats[[col]] = scaler.fit_transform(adv_stats[[col]])
adv_stats
| Crosses_stp% | PSxG Prevention per 90 | OPA/90 | AvgDist | pass (15~30 yards) | pass (30 yards <) | pass (40 yards <) | Total Passing | |
|---|---|---|---|---|---|---|---|---|
| Manuel Neuer | 0.207547 | 0.416667 | 1.000000 | 1.000000 | 0.545455 | 1.000000 | 1.000000 | 1.000000 |
| Alisson Becker | 0.509434 | 1.000000 | 0.707207 | 0.488889 | 0.636364 | 0.776190 | 0.745223 | 0.906832 |
| Ederson | 0.830189 | 0.000000 | 0.355856 | 0.400000 | 0.727273 | 0.752381 | 0.700637 | 0.987578 |
| Thibaut Courtois | 0.735849 | 0.750000 | 0.112613 | 0.055556 | 1.000000 | 0.257143 | 0.000000 | 0.621118 |
| David De Gea | 0.000000 | 0.333333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.254777 | 0.000000 |
| Diogo Costa | 1.000000 | 0.750000 | 0.229730 | 0.255556 | 0.818182 | 0.395238 | 0.528662 | 0.428571 |
| Andre Onana | 0.811321 | 0.500000 | 0.225225 | 0.211111 | 0.545455 | 0.447619 | 0.554140 | 0.645963 |
# close lines
adv_stats['close_form'] = adv_stats['Crosses_stp%']
fig2 = go.Figure()
opacity = .85
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['Manuel Neuer'].values,
theta=categories,
fill='toself',
opacity=opacity,
name='Manuel Neuer',
marker_line_width=1.5,
hovertext ='Manuel Neuer'
))
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['Alisson Becker'].values,
theta=categories,
fill='toself',
opacity=opacity,
name='Alisson Becker',
marker_line_width=.15,
hovertext='Alisson Becker'
))
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['Ederson'].values,
theta=categories,
fill='toself',
opacity=opacity,
name='Ederson',
marker_line_width=1.5,
hovertext = 'Ederson'
))
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['Thibaut Courtois'].values,
theta=categories,
fill='toself',
opacity=opacity,
name='Thibaut Courtois',
marker_line_width=1.5,
hovertext='Thibaut Courtois'
))
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['Diogo Costa'].values,
theta=categories,
fill='toself',
opacity=opacity,
name='Diogo Costa',
marker_line_width=1.5,
hovertext='Diogo Costa'
))
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['Andre Onana'].values,
theta=categories,
fill='toself',
opacity=opacity,
name='Andre Onana',
marker_line_width=1.5,
hovertext='Andre Onana'
))
fig2.add_trace(go.Scatterpolar(
r=adv_stats.loc['David De Gea'].values,
theta=categories,
fill='toself',
name='David De Gea',
marker_line_width=1.5,
hovertext='David De Gea'
))
fig2.update_layout(
polar=dict(
radialaxis=dict(
visible=False
)),
showlegend=True
)
fig2.update_layout(title_text='Comparison of (Normalized) Advanced Stats among Modern Keepers in their respective domestic leagues (2017~)',height=600)
fig2.show()
We can see that De Gea is lacking in many areas, coming in dead last for most of these stats barring PSxG Prevention per 90 and pass (40 yards <).¶
Now let's look at the 2022-2023 season across Big 5 European Leagues¶
%pip install lxml
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Requirement already satisfied: lxml in /Users/chongkyungkim/micromamba/envs/dsdev/lib/python3.11/site-packages (4.9.3) Note: you may need to restart the kernel to use updated packages.
df2223 = pd.read_html('https://fbref.com/en/comps/Big5/keepersadv/players/Big-5-European-Leagues-Stats')
df = df2223[0]
cols = df.columns
df = df[[cols[1],cols[17],cols[20],cols[29],cols[32],cols[33]]]
df.head()
| Unnamed: 1_level_0 | Expected | Launched | Crosses | Sweeper | ||
|---|---|---|---|---|---|---|
| Player | /90 | Cmp% | Stp | #OPA/90 | AvgDist | |
| 0 | Álvaro Aceves | +0.90 | 50.0 | 0 | 13.85 | 33.0 |
| 1 | Julen Agirrezabala | -0.06 | 36.6 | 11 | 1.33 | 15.4 |
| 2 | Doğan Alemdar | -0.32 | 34.3 | 5 | 1.11 | 14.8 |
| 3 | Alisson | +0.27 | 41.0 | 23 | 2.41 | 19.8 |
| 4 | Alphonse Areola | +0.09 | 37.8 | 2 | 0.29 | 10.3 |
df = df.dropna()
df.columns = ['player','PSx90PrevPer90','40+_completion_%','cross_stop_%',"sweeper_action_90","sweep_avg_dist"]
df.head()
| player | PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | |
|---|---|---|---|---|---|---|
| 0 | Álvaro Aceves | +0.90 | 50.0 | 0 | 13.85 | 33.0 |
| 1 | Julen Agirrezabala | -0.06 | 36.6 | 11 | 1.33 | 15.4 |
| 2 | Doğan Alemdar | -0.32 | 34.3 | 5 | 1.11 | 14.8 |
| 3 | Alisson | +0.27 | 41.0 | 23 | 2.41 | 19.8 |
| 4 | Alphonse Areola | +0.09 | 37.8 | 2 | 0.29 | 10.3 |
df = df[~df['PSx90PrevPer90'].isin(['/90'])]
df[['PSx90PrevPer90','40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist']] = df[['PSx90PrevPer90','40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist']].apply(pd.to_numeric)
df.head(195)
| player | PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | |
|---|---|---|---|---|---|---|
| 0 | Álvaro Aceves | 0.90 | 50.0 | 0 | 13.85 | 33.0 |
| 1 | Julen Agirrezabala | -0.06 | 36.6 | 11 | 1.33 | 15.4 |
| 2 | Doğan Alemdar | -0.32 | 34.3 | 5 | 1.11 | 14.8 |
| 3 | Alisson | 0.27 | 41.0 | 23 | 2.41 | 19.8 |
| 4 | Alphonse Areola | 0.09 | 37.8 | 2 | 0.29 | 10.3 |
| ... | ... | ... | ... | ... | ... | ... |
| 206 | Guglielmo Vicario | 0.09 | 30.3 | 34 | 0.71 | 11.5 |
| 208 | Iván Villar | -0.09 | 33.9 | 8 | 0.74 | 14.2 |
| 209 | Danny Ward | -0.21 | 31.0 | 20 | 1.62 | 15.9 |
| 210 | Axel Werner | -0.82 | 20.0 | 1 | 0.50 | 14.0 |
| 211 | Joseph Whitworth | -1.33 | 29.6 | 1 | 1.50 | 15.4 |
195 rows × 6 columns
features = ['PSx90PrevPer90','40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist']
X = df[features]
names = df['player'].values
max_clusters = 20
ks = range(2, max_clusters+1)
clusterers = [KMeans(n_clusters=k, n_init=50, random_state=109).fit(X) for k in ks]
modern_gks =['Alisson','Ederson','Diogo Costa','David de Gea', 'André Onana','Kepa Arrizabalaga',
'Mike Maignan','Jordan Pickford','Nick Pope','Jason Steele',
'Thibaut Courtois','Dean Henderson','Hugo Lloris','Robert Sánchez','Danny Ward',
'Keylor Navas','Unai Simón','Gianluigi Donnarumma','Jan Oblak','Rui Patrício',
'Aaron Ramsdale','José Sá','Neto','Illan Meslier','Emiliano Martínez','Bernd Leno','Vicente Guaita',
'Gavin Bazunu','Łukasz Fabiański','Fraser Forster','Bernd Leno','Alex McCarthy','Daniel Iversen','Mark Travers',
'Marc-André ter Stegen','Yann Sommer']
pass_df = pd.read_html('https://fbref.com/en/comps/Big5/passing/players/Big-5-European-Leagues-Stats')
passdf = pass_df[0]
passdf.columns
MultiIndex([( 'Unnamed: 0_level_0', 'Rk'),
( 'Unnamed: 1_level_0', 'Player'),
( 'Unnamed: 2_level_0', 'Nation'),
( 'Unnamed: 3_level_0', 'Pos'),
( 'Unnamed: 4_level_0', 'Squad'),
( 'Unnamed: 5_level_0', 'Comp'),
( 'Unnamed: 6_level_0', 'Age'),
( 'Unnamed: 7_level_0', 'Born'),
( 'Unnamed: 8_level_0', '90s'),
( 'Total', 'Cmp'),
( 'Total', 'Att'),
( 'Total', 'Cmp%'),
( 'Total', 'TotDist'),
( 'Total', 'PrgDist'),
( 'Short', 'Cmp'),
( 'Short', 'Att'),
( 'Short', 'Cmp%'),
( 'Medium', 'Cmp'),
( 'Medium', 'Att'),
( 'Medium', 'Cmp%'),
( 'Long', 'Cmp'),
( 'Long', 'Att'),
( 'Long', 'Cmp%'),
('Unnamed: 23_level_0', 'Ast'),
('Unnamed: 24_level_0', 'xAG'),
('Unnamed: 25_level_0', 'xA'),
('Unnamed: 26_level_0', 'A-xAG'),
('Unnamed: 27_level_0', 'KP'),
('Unnamed: 28_level_0', '1/3'),
('Unnamed: 29_level_0', 'PPA'),
('Unnamed: 30_level_0', 'CrsPA'),
('Unnamed: 31_level_0', 'PrgP'),
('Unnamed: 32_level_0', 'Matches')],
)
passdf = passdf[[( 'Unnamed: 1_level_0', 'Player'), ('Unnamed: 3_level_0','Pos'),('Medium','Cmp%'),('Long','Cmp%'),('Unnamed: 28_level_0', '1/3')]]
passdf.head()
| Unnamed: 1_level_0 | Unnamed: 3_level_0 | Medium | Long | Unnamed: 28_level_0 | |
|---|---|---|---|---|---|
| Player | Pos | Cmp% | Cmp% | 1/3 | |
| 0 | Brenden Aaronson | MF,FW | 76.9 | 38.5 | 47 |
| 1 | Paxten Aaronson | MF,DF | 60.9 | 16.7 | 3 |
| 2 | James Abankwah | DF | 75.0 | 40.0 | 0 |
| 3 | George Abbott | MF | NaN | NaN | 0 |
| 4 | Yunis Abdelhamid | DF | 90.1 | 55.6 | 155 |
passdf.columns = ['player','pos','med_completion_rate','long_completion_rate','final_third']
passdf = passdf[passdf['pos'] == 'GK']
passdf = passdf.dropna()
passdf[['med_completion_rate',
'long_completion_rate', 'final_third']] = passdf[['med_completion_rate',
'long_completion_rate', 'final_third']].apply(pd.to_numeric)
tdf = pd.merge(df, passdf, on="player")
tdf.head()
| player | PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | pos | med_completion_rate | long_completion_rate | final_third | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Álvaro Aceves | 0.90 | 50.0 | 0 | 13.85 | 33.0 | GK | 100.0 | 50.0 | 0 |
| 1 | Julen Agirrezabala | -0.06 | 36.6 | 11 | 1.33 | 15.4 | GK | 96.8 | 43.2 | 2 |
| 2 | Doğan Alemdar | -0.32 | 34.3 | 5 | 1.11 | 14.8 | GK | 100.0 | 36.4 | 2 |
| 3 | Alisson | 0.27 | 41.0 | 23 | 2.41 | 19.8 | GK | 98.6 | 58.2 | 16 |
| 4 | Alphonse Areola | 0.09 | 37.8 | 2 | 0.29 | 10.3 | GK | 100.0 | 44.2 | 0 |
categories = ['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist',
'med_completion_rate', 'long_completion_rate', 'final_third']
for cat in categories:
tdf[[cat]] = scaler.fit_transform(tdf[[cat]])
tdf.shape
(210, 10)
tdf.columns
Index(['player', 'PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist', 'pos', 'med_completion_rate',
'long_completion_rate', 'final_third'],
dtype='object')
total_features = ['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist',
'med_completion_rate', 'long_completion_rate', 'final_third']
X = tdf[total_features]
clusterers = [KMeans(n_clusters=k, n_init=50, random_state=109).fit(X) for k in ks]
inertias = [c.inertia_ for c in clusterers]
plt.plot(ks, inertias, 'o-')
plt.xticks(ks[::2])
plt.xlabel('$k$')
plt.ylabel('inertia');
plt.suptitle('KMeans Clustering of Keepers')
plt.title('Inertia vs Number of Clusters');
sil_scores = [silhouette_score(X, c.labels_) for c in clusterers]
plt.plot(ks, sil_scores, 'o-')
plt.xticks(ks[::2])
plt.xlabel('$k$')
plt.ylabel('silhouette score')
plt.suptitle('KMeans Clustering of Keepers')
plt.title('Silhouette vs Number of Clusters');
best_k = 3
kmeans = KMeans(n_clusters=best_k, n_init=50, random_state=109).fit(X.values)
labels = kmeans.labels_
pca = PCA(n_components=2).fit(X)
# project data onto 2D space spanned by components
X_pca = pca.transform(X)
X_pca.shape
(210, 2)
names = tdf['player'].values
plt.figure(figsize=(10,10))
# data points colored by cluster labels
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=.85,c=plt.cm.Accent(labels))
# annotate animal names for our random subset
for i in range(X_pca.shape[0]):
name = names[i]
if name in modern_gks:
a = plt.annotate(names[i], (X_pca[i]),size=5)
plt.xlabel(f'PCA1 ({pca.explained_variance_ratio_[0]:.2%} var explained)')
plt.ylabel(f'PCA1 ({pca.explained_variance_ratio_[1]:.2%} var explained)');
plt.title('KMeans Clustering of GKs in the Big 5 European Leagues (PCA Projection)');
With the passing data included, I'm not sure how I feel about De Gea's company! Takeaways:¶
- Dean Henderson might not be the upgrade on De Gea as some people might think he could be, based on this analysis. He doesn't necessarily bring a different type of a playing style, either.
- Again, based on this clustering scheme, this gives some idea as to how different De Gea's profile might be compared to those of Alisson, Ederson, Onana, Unai Simon, and other more proactive keepers.
- Many goalkeepers that play for top teams seem to have their goalkeepers in the orange and purple groups, and not the one De Gea belongs to
A look inside the cluster groups¶
tdf['group'] = labels
tdf.describe()
| PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | med_completion_rate | long_completion_rate | final_third | group | |
|---|---|---|---|---|---|---|---|---|---|
| count | 210.000000 | 210.00000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 |
| mean | 0.571326 | 0.36829 | 0.240714 | 0.082898 | 0.363228 | 0.855810 | 0.463881 | 0.083824 | 1.090476 |
| std | 0.109178 | 0.09896 | 0.213275 | 0.079246 | 0.124144 | 0.154953 | 0.162471 | 0.122894 | 0.872989 |
| min | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.530997 | 0.31825 | 0.037500 | 0.050181 | 0.298305 | 0.802000 | 0.350417 | 0.007353 | 0.000000 |
| 50% | 0.582210 | 0.35950 | 0.208333 | 0.072202 | 0.362712 | 0.880000 | 0.458333 | 0.044118 | 1.000000 |
| 75% | 0.619272 | 0.41275 | 0.366667 | 0.102527 | 0.430508 | 0.944000 | 0.557083 | 0.117647 | 2.000000 |
| max | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 |
tdf['group'].value_counts()
group 2 90 0 71 1 49 Name: count, dtype: int64
tdf['close'] = tdf['PSx90PrevPer90']
groups = {}
for i in range(3):
groups[f"group{i+1}"] = tdf[tdf['group'] == i]
group1 = groups['group1'].describe()
group2 = groups['group2'].describe()
group3 = groups['group3'].describe()
# group4 = groups['group4'].describe()
group1.head(8)
| PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | med_completion_rate | long_completion_rate | final_third | group | close | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 71.000000 | 71.000000 | 71.000000 | 71.000000 | 71.000000 | 71.000000 | 71.000000 | 71.000000 | 71.0 | 71.000000 |
| mean | 0.583311 | 0.364930 | 0.487793 | 0.086307 | 0.373168 | 0.847662 | 0.439343 | 0.173364 | 0.0 | 0.583311 |
| std | 0.045630 | 0.053626 | 0.149708 | 0.039546 | 0.086193 | 0.078167 | 0.110636 | 0.169668 | 0.0 | 0.045630 |
| min | 0.450135 | 0.236000 | 0.233333 | 0.021661 | 0.169492 | 0.552000 | 0.183333 | 0.000000 | 0.0 | 0.450135 |
| 25% | 0.552561 | 0.323000 | 0.366667 | 0.055596 | 0.313559 | 0.800000 | 0.356667 | 0.077206 | 0.0 | 0.552561 |
| 50% | 0.590296 | 0.366000 | 0.483333 | 0.082310 | 0.386441 | 0.856000 | 0.445000 | 0.117647 | 0.0 | 0.590296 |
| 75% | 0.609164 | 0.402000 | 0.583333 | 0.107942 | 0.430508 | 0.896000 | 0.508333 | 0.205882 | 0.0 | 0.609164 |
| max | 0.671159 | 0.471000 | 1.000000 | 0.257040 | 0.633898 | 1.000000 | 0.673333 | 1.000000 | 0.0 | 0.671159 |
group2.head(8)
| PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | med_completion_rate | long_completion_rate | final_third | group | close | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 49.000000 | 49.000000 | 49.000000 | 49.000000 | 49.000000 | 49.000000 | 49.000000 | 49.000000 | 49.0 | 49.000000 |
| mean | 0.568458 | 0.455571 | 0.106122 | 0.100995 | 0.415842 | 0.917224 | 0.663095 | 0.037065 | 1.0 | 0.568458 |
| std | 0.141109 | 0.113240 | 0.109592 | 0.141600 | 0.178160 | 0.081593 | 0.118357 | 0.045795 | 0.0 | 0.141109 |
| min | 0.231806 | 0.297000 | 0.000000 | 0.000000 | 0.016949 | 0.728000 | 0.430000 | 0.000000 | 1.0 | 0.231806 |
| 25% | 0.528302 | 0.394000 | 0.016667 | 0.051264 | 0.349153 | 0.872000 | 0.578333 | 0.007353 | 1.0 | 0.528302 |
| 50% | 0.563342 | 0.445000 | 0.050000 | 0.085921 | 0.433898 | 0.920000 | 0.646667 | 0.014706 | 1.0 | 0.563342 |
| 75% | 0.622642 | 0.500000 | 0.200000 | 0.120578 | 0.505085 | 1.000000 | 0.736667 | 0.051471 | 1.0 | 0.622642 |
| max | 0.878706 | 1.000000 | 0.350000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.161765 | 1.0 | 0.878706 |
group3.head(8)
| PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | med_completion_rate | long_completion_rate | final_third | group | close | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.000000 | 90.0 | 90.000000 |
| mean | 0.563432 | 0.323422 | 0.119074 | 0.070357 | 0.326742 | 0.828800 | 0.374778 | 0.038644 | 2.0 | 0.563432 |
| std | 0.124230 | 0.087066 | 0.095254 | 0.048529 | 0.101392 | 0.212416 | 0.120195 | 0.045052 | 0.0 | 0.124230 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.0 | 0.000000 |
| 25% | 0.500000 | 0.291500 | 0.033333 | 0.041155 | 0.283051 | 0.786000 | 0.315417 | 0.007353 | 2.0 | 0.500000 |
| 50% | 0.575472 | 0.333000 | 0.091667 | 0.071480 | 0.340678 | 0.884000 | 0.381667 | 0.022059 | 2.0 | 0.575472 |
| 75% | 0.625337 | 0.360000 | 0.216667 | 0.090794 | 0.383051 | 0.990000 | 0.462500 | 0.044118 | 2.0 | 0.625337 |
| max | 1.000000 | 0.615000 | 0.300000 | 0.361011 | 0.603390 | 1.000000 | 0.618333 | 0.235294 | 2.0 | 1.000000 |
group1.columns
Index(['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist', 'med_completion_rate',
'long_completion_rate', 'final_third', 'group', 'close'],
dtype='object')
feats = ['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist', 'med_completion_rate',
'long_completion_rate', 'final_third', 'PSx90PrevPer90']
tdf[tdf['player']=='David de Gea']
| player | PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | pos | med_completion_rate | long_completion_rate | final_third | group | close | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 60 | David de Gea | 0.584906 | 0.314 | 0.25 | 0.06065 | 0.383051 | GK | 0.84 | 0.393333 | 0.051471 | 2 | 0.584906 |
De Gea is in Group 3; let's see how their group compared to the others¶
fig4 = go.Figure()
fig4.add_trace(go.Scatterpolar(
r = group1[feats].loc['mean'].values,
theta = feats,
fill='toself',
name='Group 1',
marker_line_width=1.5,
hovertext ='Group 1'
))
fig4.add_trace(go.Scatterpolar(
r=group2[feats].loc['mean'].values,
theta=feats,
# fill='toself',
name='Group 2',
marker_line_width=.15,
hovertext='Group 2'
))
fig4.add_trace(go.Scatterpolar(
r=group3[feats].loc['mean'].values,
theta=feats,
# fill='toself',
name='Group 3',
marker_line_width=1.5,
hovertext = 'Group 3'
))
# fig4.add_trace(go.Scatterpolar(
# r=group4[feats].loc['mean'].values,
# theta=feats,
# # fill='toself',
# name='Group 4',
# marker_line_width=1.5,
# hovertext = 'Group 4'
# ))
fig4.update_layout(
polar=dict(
radialaxis=dict(
visible=False
)),
showlegend=True
)
fig4.update_layout(title_text='Comparison of (Normalized) Average of Advanced Stats among Profile Groups in their respective domestic leagues (2017~)',height=600)
fig4.show()
groups['group1']['player'].values
array(['Alisson', 'Emil Audero', 'Édgar Badía', 'Oliver Baumann',
'Gavin Bazunu', 'Paul Bernardoni', 'Marco Bizot', 'Janis Blaswich',
'Yassine Bounou', 'Marco Carnesecchi', 'Koen Casteels',
'Lucas Chevalier', 'Oliver Christensen', 'Andrea Consigli',
'Thibaut Courtois', 'Mory Diaw', 'Stole Dimitrievski',
'Yehvann Diouf', 'Gianluigi Donnarumma', 'Maxime Dupé',
'Łukasz Fabiański', 'Wladimiro Falcone', 'Aitor Fernández',
'Fernando', 'Mark Flekken', 'Gauthier Gallon', 'Paulo Gazzaniga',
'Rafał Gikiewicz', 'Ivo Grbić', 'Vicente Guaita', 'Dean Henderson',
'Sergio Herrera', 'Lukáš Hrádecký', 'Alban Lafont',
'Jeremías Ledesma', 'Bernd Leno', 'Benjamin Leroy', 'Hugo Lloris',
'Anthony Lopes', 'Pau López', 'Giorgi Mamardashvili',
'Steve Mandanda', 'Emiliano Martínez', 'Illan Meslier',
'Vanja Milinković-Savić', 'Lorenzo Montipò', 'Yvon Mvogo', 'Neto',
'Alexander Nübel', 'Jan Oblak', 'Jiří Pavlenka', 'Jordan Pickford',
'Nick Pope', 'Predrag Rajković', 'Aaron Ramsdale', 'David Raya',
'Álex Remiro', 'Manuel Riemann', 'Frederik Rønnow', 'José Sá',
'Brice Samba', 'Kasper Schmeichel', 'Marvin Schwäbe', 'Matz Sels',
'Rui Silva', 'Unai Simón', 'Łukasz Skorupski', 'David Soria',
'Guglielmo Vicario', 'Danny Ward', 'Robin Zentner'], dtype=object)
groups['group2']['player'].values
array(['Álvaro Aceves', 'Kepa Arrizabalaga', 'Fabian Bredlow',
'Juan Carlos', 'Michele Cerofolini', 'Alessio Cragno',
'Rémy Descamps', 'Martin Dúbravka', 'Ederson', 'Álvaro Fernández',
'Joan García', 'Pierluigi Gollini', 'Pierluigi Gollini',
'Pierluigi Gollini', 'Pierluigi Gollini', 'Dominik Greif',
'Péter Gulácsi', 'Samir Handanović', 'Caoimhín Kelleher',
'Gregor Kobel', 'Jean-Louis Leca', 'Benjamin Lecomte',
'Andriy Lunin', 'Mike Maignan', 'Federico Marchetti',
'Diego Mariño', 'Alex Meret', 'Alexander Meyer', 'Florian Müller',
'Manuel Neuer', 'André Onana', 'Stefan Ortega', 'Fernando Pacheco',
'Rui Patrício', 'Gianluca Pegolo', 'Iñaki Peña', 'Ivan Provedel',
'Leonardo Román', 'Gerónimo Rulli', 'Mouhamadou Sarr',
'Salvatore Sirigu', 'Yann Sommer', 'Yann Sommer', 'Jason Steele',
'Wojciech Szczęsny', 'Ciprian Tătărușanu', 'Marc-André ter Stegen',
'Pietro Terracciano', 'Michael Zetterer'], dtype=object)
groups['group3']['player'].unique()
array(['Julen Agirrezabala', 'Doğan Alemdar', 'Alphonse Areola',
'Sergio Asenjo', 'Asmir Begović', 'Daniel Bentley', 'Rubén Blanco',
'Joaquín Blázquez', 'Claudio Bravo', 'Marcin Bułka',
'Matis Carvalho', 'Benoît Costil', 'Finn Dahmen',
'Michele Di Gregorio', 'Ouparine Djoco', 'Marko Dmitrović',
'Bartłomiej Drągowski', 'Tjark Ernst', 'Ralf Fährmann',
'Vincenzo Fiorillo', 'Yahia Fofana', 'Fraser Forster',
'David de Gea', 'David Gil', 'Lennart Grill', 'Wayne Hennessey',
'Daniel Iversen', 'Sam Johnstone', 'Filip Jørgensen',
'Bingourou Kamara', 'Tomáš Koubek', 'Benjamin Lecomte',
'Donovan Léon', 'Mateusz Lis', 'Diego López', 'Andrey Lunyov',
'Vito Mannone', 'Agustín Marchesín', 'Jordi Masip',
'Alex McCarthy', 'Edouard Mendy', 'Juan Musso', 'Keylor Navas',
'Ørjan Nyland', 'Guillermo Ochoa', 'Jan Olschowsky', 'Robin Olsen',
'Jonas Omlin', 'Fernando Pacheco', 'Patrick Pentz',
'Simone Perilli', 'Mattia Perin', 'Samuele Perisan',
'Ghjuvanni Quilichini', 'Ionuț Radu', 'Diant Ramaj',
'Nicola Ravaglia', 'Pepe Reina', 'Rémy Riou', 'Joel Robles',
'Marek Rodák', 'Alessandro Russo', 'Alexander Schwolow',
'Luigi Sepe', 'Marco Silvestri', 'Tobias Sippel',
'François-Joseph Sollacaro', 'Yann Sommer', 'Marco Sportiello',
'Mile Svilar', 'Kevin Trapp', 'Mark Travers', 'Martin Turk',
'Sven Ulreich', 'Iván Villar', 'Axel Werner', 'Joseph Whitworth',
'Jeroen Zoet', 'Petar Zovko'], dtype=object)
De Gea, based on this clustering analysis, is associated with keepers with:¶
- Group 1 seems to include keepers who are proactive, claiming crosses the most with decent defensive actions and are okay passers. e.g. Alisson, Courtois, Nick Pope, Pickford, Ramsdale, Unai Simon
- Group 2 seems to include keepers who are great passers with okay proactive, defensive actions e.g. Ederson, Jason Steele, ter Stegen, Neuer
- Group 3 seems to include keepers who aren't particularly great at anything
- The difference is minimal, but group 3 also has the worst goal prevention based on PSxG.
- Group 3 keepers have the worst average stat for all metrics barring crosses stopped percentage and passes into the final third.
- De gea, mind you, belongs to Group 3. Make of that what you will!
tdf.head()
| player | PSx90PrevPer90 | 40+_completion_% | cross_stop_% | sweeper_action_90 | sweep_avg_dist | pos | med_completion_rate | long_completion_rate | final_third | group | close | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Álvaro Aceves | 0.832884 | 0.500 | 0.000000 | 1.000000 | 1.000000 | GK | 1.000 | 0.500000 | 0.000000 | 1 | 0.832884 |
| 1 | Julen Agirrezabala | 0.574124 | 0.366 | 0.183333 | 0.096029 | 0.403390 | GK | 0.744 | 0.386667 | 0.014706 | 2 | 0.574124 |
| 2 | Doğan Alemdar | 0.504043 | 0.343 | 0.083333 | 0.080144 | 0.383051 | GK | 1.000 | 0.273333 | 0.014706 | 2 | 0.504043 |
| 3 | Alisson | 0.663073 | 0.410 | 0.383333 | 0.174007 | 0.552542 | GK | 0.888 | 0.636667 | 0.117647 | 0 | 0.663073 |
| 4 | Alphonse Areola | 0.614555 | 0.378 | 0.033333 | 0.020939 | 0.230508 | GK | 1.000 | 0.403333 | 0.000000 | 2 | 0.614555 |
tdf.columns
Index(['player', 'PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist', 'pos', 'med_completion_rate',
'long_completion_rate', 'final_third', 'group', 'close'],
dtype='object')
tdf.loc[tdf['player']=='Aaron Ramsdale'].values[0][1:]
array([0.5768194070080862, 0.254, 0.36666666666666664, 0.0815884476534296,
0.42711864406779665, 'GK', 0.8560000000000008, 0.26333333333333325,
0.22058823529411764, 0, 0.5768194070080862], dtype=object)
import plotly.express as px
Percentile Visualizer¶
def pv(player):
df = tdf.loc[tdf['player']==player]
df = df[['PSx90PrevPer90', '40+_completion_%', 'cross_stop_%',
'sweeper_action_90', 'sweep_avg_dist', 'med_completion_rate',
'long_completion_rate', 'final_third']]
pcts = df.values[0]*100
fig, ax = plt.subplots()
ax.barh(df.columns, pcts)
ax.set_ylabel('Attributes')
ax.set_title(f'{player} Attribute Percentile',)
# ax.legend(title='Fruit color')
for index, value in enumerate(pcts):
value = round(value,2)
plt.text(value, index,str(value))
plt.show()
tdf['player']
0 Álvaro Aceves
1 Julen Agirrezabala
2 Doğan Alemdar
3 Alisson
4 Alphonse Areola
...
205 Joseph Whitworth
206 Robin Zentner
207 Michael Zetterer
208 Jeroen Zoet
209 Petar Zovko
Name: player, Length: 210, dtype: object
pv('Jason Steele')